knitr::opts_chunk$set( echo = FALSE, warning = FALSE, message = FALSE)
In this assignment, it will be predicted some of products that are sold at an e-commerce platform called ‘Trendyol’. The sold count will be examined for each product and data willbe decomposed. Then, some forecasting strategies will be developed and the best among them according to their weighted mean absolute errors will be picked. The data before 23 June 2021 will be train dataset for our models to learn and data from 24 June to 30 June 2021 will be test dataset. There are 9 products that it will be examined:
Before making alternative models, it should be looked at the plot of data and examined the seasonalities and trend. First of all, it’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is a slightly increasing trend, especially in the middle of the plot. There can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t very significant but it is seen that the data is higher in the begining of the month and decreases to the end of the month. It can be said that there is monthly seasonality.
Before decomposing, to make better decision, it can be looked to the autocorrelation plot of data. Below, it can be seen that there is a spike at lag 63. We can determine the frequency as 63 just because the seasonality wasn’t very significant.
Now, the data will be decomposed in order to make random series.“Additive” type of decomposing will be used because the variance of data is very low due to switching to logarithm. Plots of deseasonalized and the random series their autocorrelations can be seen below.
For Arima model, (p,d,q) values should be chosen. For this purpose, it can be looked at the ACF and PACF plots. Looking at the ACF, for ‘q’ value 1 can be chosen and looking at the PACF, for ‘p’ value 1 or 4 can be chosen as well.
The AIC and BIC values of models that are suggested can be seen below. Also, auto.arima function is used. Smaller AIC and BIC values means the model is better. So, looking at AIC and BIC values, (1,0,0) model that auto.arima is suggested is best among them. We can proceed with this model.
##
## Call:
## arima(x = detrend, order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## 0.6310 0.0965 0.0000
## s.e. 0.0598 0.0745 0.0579
##
## sigma^2 estimated as 0.1279: log likelihood = -130.47, aic = 268.94
## [1] 268.9444
## [1] 284.177
##
## Call:
## arima(x = detrend, order = c(4, 0, 1))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 intercept
## 0.7443 -0.0417 0.0122 -0.1188 -0.0239 -0.0001
## s.e. 0.3478 0.2596 0.0682 0.0610 0.3486 0.0468
##
## sigma^2 estimated as 0.1252: log likelihood = -126.94, aic = 267.88
## [1] 267.8753
## [1] 294.5323
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : 268.7823
## ARIMA(0,0,0) with non-zero mean : 475.4002
## ARIMA(1,0,0) with non-zero mean : 268.5768
## ARIMA(0,0,1) with non-zero mean : 325.494
## ARIMA(0,0,0) with zero mean : 473.3897
## ARIMA(2,0,0) with non-zero mean : 268.8005
## ARIMA(1,0,1) with non-zero mean : 269.0137
## ARIMA(2,0,1) with non-zero mean : 267.8519
## ARIMA(3,0,1) with non-zero mean : Inf
## ARIMA(1,0,2) with non-zero mean : 270.2874
## ARIMA(3,0,0) with non-zero mean : 269.9104
## ARIMA(3,0,2) with non-zero mean : Inf
## ARIMA(2,0,1) with zero mean : 265.8127
## ARIMA(1,0,1) with zero mean : 266.9701
## ARIMA(2,0,0) with zero mean : 266.7685
## ARIMA(3,0,1) with zero mean : Inf
## ARIMA(2,0,2) with zero mean : 266.7251
## ARIMA(1,0,0) with zero mean : 266.5464
## ARIMA(1,0,2) with zero mean : 268.23
## ARIMA(3,0,0) with zero mean : 267.8628
## ARIMA(3,0,2) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,1) with zero mean : Inf
## ARIMA(1,0,0) with zero mean : 266.6784
##
## Best model: ARIMA(1,0,0) with zero mean
## Series: detrend
## ARIMA(1,0,0) with zero mean
##
## Coefficients:
## ar1
## 0.6819
## s.e. 0.0399
##
## sigma^2 estimated as 0.129: log likelihood=-131.32
## AIC=266.64 AICc=266.68 BIC=274.26
## [1] 266.642
## [1] 274.2583
Below, there is comparison with training set and fitted values of final model to understand how the final model is learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.
The model can be improved with adding regressors. For deciding which regressor that should be added, it should be looked the correlation matrix that contains different attributes. Looking at the correlations with logarithm of sold counts, category_favored and basket_count attributes should be chosen.
The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceed with this model. Also, the fitted values are better for training set and it can be seen in the plots. As you can see, these fitted plots are better than the model that previously chosen.
##
## Call:
## arima(x = detrend, order = c(1, 0, 0), xreg = xreg)
##
## Coefficients:
## ar1 intercept xreg1 xreg2
## 0.7532 -0.6575 0.0015 0
## s.e. 0.0383 0.0254 0.0002 NaN
##
## sigma^2 estimated as 0.07767: log likelihood = -47.46, aic = 104.92
## [1] 104.9199
## [1] 123.9606
The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.
## [1] " Weighted Mean Absolute Percentage Error : 33.0179407746578"
It should be looked at the plot of data and examined the seasonalities and trend at first. It’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is a slightly increasing trend, especially in the end of the plot. There can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t significant, though it can be said there is a spike in the plot at the beginning of the month. In May, there is a big rising probably due to Covid-19 conditions. In conclusion, it can be said that there is monthly seasonality but it isn’t very clear.
Before decomposing, to make better decision, it can be looked to the autocorrelation plot of data. Below, it can be seen that there is a spike at lag 27. We can determine the frequency as 27 just because the seasonality wasn’t very significant.
Now, the data will be decomposed in order to make random series.“Additive” type of decomposing will be used because the variance of data is very low due to switching to logarithm. Plots of deseasonalized and the random series their autocorrelations can be seen below.
For Arima model, (p,d,q) values should be chosen. For this purpose, it can be looked at the ACF and PACF plots. Looking at the ACF, for ‘q’ value 2 or 10 can be chosen and looking at the PACF, for ‘p’ value 2 can be chosen.
The AIC and BIC values of models that are suggested can be seen below. Also, auto.arima function is used. So, looking at AIC and BIC values, (2,0,2) model that is suggested above is best among them. We can proceed with this model.
##
## Call:
## arima(x = detrend2, order = c(2, 0, 2))
##
## Coefficients:
## ar1 ar2 ma1 ma2 intercept
## 1.6173 -0.7246 -0.7336 -0.2664 0.0002
## s.e. 0.0441 0.0436 0.0670 0.0667 0.0022
##
## sigma^2 estimated as 0.1442: log likelihood = -169.07, aic = 350.13
## [1] 350.1308
## [1] 373.5956
##
## Call:
## arima(x = detrend2, order = c(2, 0, 10))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 ma5 ma6
## 1.5860 -0.8114 -0.7230 -0.1593 0.1764 -0.0368 0.0068 0.0767
## s.e. 0.0961 0.0941 0.1106 0.0677 0.0759 0.0678 0.0691 0.0720
## ma7 ma8 ma9 ma10 intercept
## 0.0054 -0.1594 -0.0731 -0.1137 0.0003
## s.e. 0.0634 0.0626 0.0650 0.0726 0.0025
##
## sigma^2 estimated as 0.1392: log likelihood = -162.61, aic = 353.23
## [1] 353.2274
## [1] 407.9785
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 760.1106
## ARIMA(1,0,0) with non-zero mean : 443.9374
## ARIMA(0,0,1) with non-zero mean : 495.5242
## ARIMA(0,0,0) with zero mean : 758.0933
## ARIMA(2,0,0) with non-zero mean : 397.8918
## ARIMA(3,0,0) with non-zero mean : 400.2801
## ARIMA(2,0,1) with non-zero mean : 397.8625
## ARIMA(1,0,1) with non-zero mean : 403.3121
## ARIMA(3,0,1) with non-zero mean : 401.3931
## ARIMA(1,0,2) with non-zero mean : 403.9694
## ARIMA(3,0,2) with non-zero mean : Inf
## ARIMA(2,0,1) with zero mean : 395.8077
## ARIMA(1,0,1) with zero mean : 401.2683
## ARIMA(2,0,0) with zero mean : 395.8478
## ARIMA(3,0,1) with zero mean : 399.3263
## ARIMA(2,0,2) with zero mean : Inf
## ARIMA(1,0,0) with zero mean : 441.9045
## ARIMA(1,0,2) with zero mean : 401.9144
## ARIMA(3,0,0) with zero mean : 398.2249
## ARIMA(3,0,2) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,1) with zero mean : 395.0582
##
## Best model: ARIMA(2,0,1) with zero mean
## Series: detrend2
## ARIMA(2,0,1) with zero mean
##
## Coefficients:
## ar1 ar2 ma1
## 1.3194 -0.5680 -0.3469
## s.e. 0.2052 0.1465 0.2563
##
## sigma^2 estimated as 0.1679: log likelihood=-193.47
## AIC=394.95 AICc=395.06 BIC=410.59
## [1] 394.9483
## [1] 410.5915
Below, there is comparison with training set and fitted values of final model to understand how the final model is learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.
The model can be improved with adding regressors. For deciding which regressor that should be added, it should be looked the correlation matrix that contains different attributes. Looking at the correlations with logarithm of sold counts, category_sold and basket_count attributes should be chosen.
The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceed with this model. Also, the fitted values are better for training set and it can be seen in the plots. As you can see, these fitted plots are better than the model that previously chosen.
##
## Call:
## arima(x = detrend2, order = c(2, 0, 2), xreg = xreg2)
##
## Coefficients:
## ar1 ar2 ma1 ma2 intercept xreg21 xreg22
## 0.0493 0.4027 0.7695 0.1611 -0.5382 3e-04 1e-04
## s.e. 0.2893 0.2104 0.2899 0.0723 0.0077 1e-04 0e+00
##
## sigma^2 estimated as 0.0749: log likelihood = -45.87, aic = 107.74
## [1] 107.7449
## [1] 139.0313
The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.
## [1] " Weighted Mean Absolute Percentage Error : 26.774801709638"
First of all, it’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. At below,looking at the plots of the product; in line graph it can be observed that the sales have variance, in some dates the plot has peaks and also there might be a cyclical behaviour which is an indicator for seasonality.
In order to proceed further,the data should be decomposed,a frequency value should be chosen. 30 and 7 day frequency can be selected and the data can be decomposed accordingly. Along with 30 and 7 day frequency, ACF plot of the data can be examined and in the lag that we see high autocorrelation it can be chosen as another trial frequency to decompose. Since variance don’t seem to be increasing, additive type of decomposition can be used for decomposition. Below, the random series can be seen.
The above decomposition series belong to time series with 7 and 30 days frequency, respectively.
Looking at the ACF plot of the series, highest ACF value belongs to lag 32, so time series decomposition with 32 day frequency would be sufficient.
In time series decomposition, it is assumed that the random part is randomly distributed with mean zero and standard deviation 1; in order to decide on the best frequency, the random part of the decomposed series should be observed. In this case, the random part of the decomposed time series with 7 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.
After the decomposition, (p,d,q) values should be chosen for the model. For this task, ACF and PACF will be examined.For q, peaks at ACF function can be chosen and for p values, peaks at PACF function can be chosen. Looking at the ACF, for ‘q’ value 3 or 4 may be selected and looking at the PACF, for ‘p’ value 3 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Smaller AIC and BIC values means the model is better. So, looking at AIC and BIC values, (3,0,3) model that auto.arima has suggested is best among them.
##
## Call:
## arima(x = detrend, order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## -0.0087 0.3157 0.0007
## s.e. 0.1050 0.0917 0.0164
##
## sigma^2 estimated as 0.06159: log likelihood = -9.89, aic = 27.78
## [1] 27.78484
## [1] 43.63916
##
## Call:
## arima(x = detrend, order = c(3, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 intercept
## 0.1673 0.3160 -0.4474 -0.1977 -0.7620 -0.0403 -1e-04
## s.e. 0.1826 0.2483 0.1531 0.1866 0.2702 0.1503 2e-04
##
## sigma^2 estimated as 0.04082: log likelihood = 67.22, aic = -118.44
## [1] -118.4387
## [1] -86.73004
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : -107.9538
## ARIMA(0,0,0) with non-zero mean : 58.78767
## ARIMA(1,0,0) with non-zero mean : 36.60636
## ARIMA(0,0,1) with non-zero mean : 25.80072
## ARIMA(0,0,0) with zero mean : 56.76901
## ARIMA(1,0,2) with non-zero mean : Inf
## ARIMA(2,0,1) with non-zero mean : -109.9478
## ARIMA(1,0,1) with non-zero mean : 28.34829
## ARIMA(2,0,0) with non-zero mean : 4.567387
## ARIMA(3,0,1) with non-zero mean : Inf
## ARIMA(3,0,0) with non-zero mean : -33.54895
## ARIMA(3,0,2) with non-zero mean : Inf
## ARIMA(2,0,1) with zero mean : -111.095
## ARIMA(1,0,1) with zero mean : 26.31239
## ARIMA(2,0,0) with zero mean : 2.52807
## ARIMA(3,0,1) with zero mean : Inf
## ARIMA(2,0,2) with zero mean : -109.1481
## ARIMA(1,0,0) with zero mean : 34.58137
## ARIMA(1,0,2) with zero mean : -68.55153
## ARIMA(3,0,0) with zero mean : -35.59868
## ARIMA(3,0,2) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,1) with zero mean : Inf
## ARIMA(2,0,1) with non-zero mean : Inf
## ARIMA(2,0,2) with zero mean : Inf
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(1,0,2) with zero mean : Inf
## ARIMA(3,0,0) with zero mean : -37.5118
##
## Best model: ARIMA(3,0,0) with zero mean
## Series: detrend
## ARIMA(3,0,0) with zero mean
##
## Coefficients:
## ar1 ar2 ar3
## 0.2286 -0.1907 -0.315
## s.e. 0.0481 0.0485 0.048
##
## sigma^2 estimated as 0.0524: log likelihood=22.81
## AIC=-37.62 AICc=-37.51 BIC=-21.76
## [1] -37.61596
## [1] -21.76165
Below, there is comparison with training set and fitted values of final model to understand how the final model has learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.
The fitted values have captured partly behaviour of the series, however prediction values seem to be overpredicting in peak values of the original data. The model can be improved with adding regressors. For deciding which regressor that should be added, the correlation matrix that contains different attributes should be examined. Looking at the correlations with logarithm of category_sold and visit_count attributes should be chosen.
##
## Call:
## arima(x = detrend, order = c(3, 0, 3), xreg = xreg)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 intercept xreg1
## 0.1639 0.3118 -0.4518 -0.2009 -0.7632 -0.0359 -0.0267 0
## s.e. 0.1817 0.2405 0.1263 0.1865 0.2598 0.1219 NaN NaN
## xreg2
## 0
## s.e. NaN
##
## sigma^2 estimated as 0.04005: log likelihood = 70.91, aic = -121.82
## [1] -121.8218
## [1] -82.18603
The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceeded with this model.
The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.
## Time Series:
## Start = c(1, 1)
## End = c(1, 7)
## Frequency = 7
## [1] 560.3584 522.0930 483.4829 465.1576 526.3092 548.6215 554.8162
## [1] " Weighted Mean Absolute Percentage Error : 74.2426820017771"
Looking at the plot for Predictions vs Actual Sales, it can be said that the model captured the behaviour of the data but there is almost a constant difference between prediction and actual sales, that is probably why the WMAPE is high, in order to solve this issue, further investigation can be made.
First of all, it’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is no significant trend. There may be seasonality, to look further, there is a plot of 3 months of 2021 - March, April and May. The seasonality isn’t easily observed, though it can be said there is a spike in the plot at the end of the month. In May, there is a big rising probably due to Covid-19 conditions. In conclusion, it can be said that there is monthly seasonality but it isn’t very clear.
30 and 7 day frequency can be selected and the data can be decomposed accordingly. Along with 30 and 7 day frequency, ACF plot of the data can be examined and in the lag that we see high autocorrelation it can be chosen as another trial frequency to decompose. Since variance don’t seem to be increasing, additive type of decomposition can be used for decomposition. Below, the random series can be seen.
The above decomposition series belong to time series with 7 and 30 days frequency, respectively.
Looking at the ACF plot of the series, highest ACF value belongs to lag 35, so time series decomposition with 35 day frequency would be sufficient.
In this case, the random part of the decomposed time series with 35 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.
Looking at the ACF, for ‘q’ value 1 or 3 may be selected and looking at the PACF, for ‘p’ value 1 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Looking at AIC and BIC values, (1,0,3) model which has been built with respect to observations in ACF and PACF plots is best among them. Also it is the same model that auto arima has suggested.
##
## Call:
## arima(x = detrend, order = c(1, 0, 1))
##
## Coefficients:
## ar1 ma1 intercept
## 0.6785 -0.0670 0.0078
## s.e. 0.0548 0.0696 0.0539
##
## sigma^2 estimated as 0.1261: log likelihood = -138.75, aic = 285.51
## [1] 285.5067
## [1] 301.0623
##
## Call:
## arima(x = detrend, order = c(1, 0, 3))
##
## Coefficients:
## ar1 ma1 ma2 ma3 intercept
## 0.3775 0.1972 0.2890 0.2218 0.0078
## s.e. 0.1158 0.1105 0.0751 0.0663 0.0502
##
## sigma^2 estimated as 0.1216: log likelihood = -132.26, aic = 276.52
## [1] 276.5232
## [1] 299.8564
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : 270.3389
## ARIMA(0,0,0) with non-zero mean : 472.5668
## ARIMA(1,0,0) with non-zero mean : 284.9426
## ARIMA(0,0,1) with non-zero mean : 348.2958
## ARIMA(0,0,0) with zero mean : 470.6423
## ARIMA(1,0,2) with non-zero mean : 284.6
## ARIMA(2,0,1) with non-zero mean : 288.3505
## ARIMA(3,0,2) with non-zero mean : 277.7656
## ARIMA(2,0,3) with non-zero mean : 278.0155
## ARIMA(1,0,1) with non-zero mean : 286.0879
## ARIMA(1,0,3) with non-zero mean : 277.0941
## ARIMA(3,0,1) with non-zero mean : 273.6461
## ARIMA(3,0,3) with non-zero mean : 278.5149
## ARIMA(2,0,2) with zero mean : 268.5336
## ARIMA(1,0,2) with zero mean : 282.5616
## ARIMA(2,0,1) with zero mean : 286.312
## ARIMA(3,0,2) with zero mean : 275.702
## ARIMA(2,0,3) with zero mean : 275.9444
## ARIMA(1,0,1) with zero mean : 284.0618
## ARIMA(1,0,3) with zero mean : 275.0447
## ARIMA(3,0,1) with zero mean : 271.7115
## ARIMA(3,0,3) with zero mean : 276.4253
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,2) with zero mean : Inf
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(3,0,1) with zero mean : Inf
## ARIMA(3,0,1) with non-zero mean : Inf
## ARIMA(1,0,3) with zero mean : 274.7166
##
## Best model: ARIMA(1,0,3) with zero mean
## Series: detrend
## ARIMA(1,0,3) with zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3
## 0.3771 0.1976 0.2893 0.2221
## s.e. 0.1159 0.1105 0.0751 0.0662
##
## sigma^2 estimated as 0.123: log likelihood=-132.27
## AIC=274.55 AICc=274.72 BIC=293.99
## [1] 274.5476
## [1] 293.9919
Below, there is comparison with training set and fitted values of final model to understand how the final model has learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.
The fitted values have captured partly behaviour of the series, however prediction values seem to be overpredicting in peak values of the original data. The model can be improved with adding regressors. For deciding which regressor that should be added, the correlation matrix that contains different attributes should be examined. Looking at the correlations with logarithm of category_favored and price attributes should be chosen.
The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceeded with this model.
##
## Call:
## arima(x = detrend, order = c(1, 0, 3), xreg = xreg)
##
## Coefficients:
## ar1 ma1 ma2 ma3 intercept xreg1 xreg2
## 0.3379 0.0382 0.2030 0.2168 1.9104 1e-04 -0.0088
## s.e. 0.1498 0.1440 0.0704 0.0659 0.3527 NaN 0.0012
##
## sigma^2 estimated as 0.09467: log likelihood = -86.96, aic = 189.92
## [1] 189.9197
## [1] 221.0307
The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.
## Time Series:
## Start = c(1, 1)
## End = c(1, 7)
## Frequency = 7
## [1] 13.01960 11.29562 12.22475 12.54396 13.87283 15.89303 17.46858
## [1] " Weighted Mean Absolute Percentage Error : 40.265400020764"
Looking at the plot for Predictions vs Actual Sales, it can be said that the model captured the behaviour of the data partly but it is not the best fitting, also it can be seen on the above plot that there is a peak at Mean Absolute Errors in June 25, this may be an outlier in the data, that may be the reason why the WMAPE is high, in order to solve this issue, further investigation can be made.
First of all, it’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is a decreasing trend. There may be seasonality, to look further, there is a plot of 3 months of 2021 - March, April and May. The seasonality isn’t easily observed, though it can be said there is a spike in the plot in the middle of the month. In May, there is a big rising probably due to Covid-19 conditions. In conclusion, it can be said that there is monthly seasonality but it isn’t very clear. 30 and 7 day frequency can be selected and the data can be decomposed accordingly. Along with 30 and 7 day frequency, ACF plot of the data can be examined and in the lag that we see high autocorrelation it can be chosen as another trial frequency to decompose. Since variance don’t seem to be increasing, additive type of decomposition can be used for decomposition. Below, the random series can be seen.
The above decomposition series belong to time series with 7 and 30 days frequency, respectively.
Looking at the ACF plot of the series, highest ACF value belongs to lag 62, so time series decomposition with 62 day frequency would be sufficient.
In this case, the random part of the decomposed time series with 7 day frequency seem to be closer to randomly distributed series with mean zero and std dev 1, so it is chosen as the final decomposition.
Looking at the ACF, for ‘q’ value 1 , 3 or 5 may be selected and looking at the PACF, for ‘p’ value 2 may be selected. Also, auto.arima function is used as well. The AIC and BIC values of models that are suggested can be seen below. Looking at AIC and BIC values, (2,0,5) model which has been built with respect to observations in ACF and PACF plots is best among them. Also it is better than the model that auto arima has suggested.Arima (2,0,5)’s AIC value is smaller than Arıma(1,0,1) which is suggested b auto arima.
##
## Call:
## arima(x = detrend, order = c(2, 0, 1))
##
## Coefficients:
## ar1 ar2 ma1 intercept
## 1.1647 -0.6177 -1.0000 1e-04
## s.e. 0.0400 0.0403 0.0066 3e-04
##
## sigma^2 estimated as 0.07068: log likelihood = -39.67, aic = 89.33
## [1] 89.33336
## [1] 109.1513
##
## Call:
## arima(x = detrend, order = c(2, 0, 3))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 intercept
## 1.4805 -0.6927 -1.4071 -0.0561 0.4632 1e-04
## s.e. 0.0410 0.0400 0.0488 0.0822 0.0450 1e-04
##
## sigma^2 estimated as 0.06498: log likelihood = -24.9, aic = 63.8
## [1] 63.79951
## [1] 91.54457
##
## Call:
## arima(x = detrend, order = c(2, 0, 5))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 ma5 intercept
## 1.4236 -0.6318 -1.2980 -0.1782 0.3118 0.2256 -0.0611 1e-04
## s.e. 0.0986 0.0813 0.1137 0.1112 0.1157 0.0907 0.0839 1e-04
##
## sigma^2 estimated as 0.06376: log likelihood = -21.24, aic = 60.48
## [1] 60.48229
## [1] 96.1545
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : 103.7667
## ARIMA(0,0,0) with non-zero mean : 319.1792
## ARIMA(1,0,0) with non-zero mean : 256.6807
## ARIMA(0,0,1) with non-zero mean : 225.066
## ARIMA(0,0,0) with zero mean : 317.1717
## ARIMA(1,0,2) with non-zero mean : 212.6558
## ARIMA(2,0,1) with non-zero mean : 123.8577
## ARIMA(3,0,2) with non-zero mean : Inf
## ARIMA(2,0,3) with non-zero mean : Inf
## ARIMA(1,0,1) with non-zero mean : 217.1248
## ARIMA(1,0,3) with non-zero mean : 144.7481
## ARIMA(3,0,1) with non-zero mean : Inf
## ARIMA(3,0,3) with non-zero mean : Inf
## ARIMA(2,0,2) with zero mean : 102.2443
## ARIMA(1,0,2) with zero mean : 210.6276
## ARIMA(2,0,1) with zero mean : 122.168
## ARIMA(3,0,2) with zero mean : Inf
## ARIMA(2,0,3) with zero mean : Inf
## ARIMA(1,0,1) with zero mean : 215.1369
## ARIMA(1,0,3) with zero mean : 143.1001
## ARIMA(3,0,1) with zero mean : Inf
## ARIMA(3,0,3) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,2) with zero mean : Inf
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(2,0,1) with zero mean : Inf
## ARIMA(2,0,1) with non-zero mean : Inf
## ARIMA(1,0,3) with zero mean : Inf
## ARIMA(1,0,3) with non-zero mean : Inf
## ARIMA(1,0,2) with zero mean : Inf
## ARIMA(1,0,2) with non-zero mean : Inf
## ARIMA(1,0,1) with zero mean : 221.9019
##
## Best model: ARIMA(1,0,1) with zero mean
## Series: detrend
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## 0.0506 0.4942
## s.e. 0.0794 0.0627
##
## sigma^2 estimated as 0.1024: log likelihood=-107.92
## AIC=221.84 AICc=221.9 BIC=233.73
## [1] 221.8396
## [1] 233.7303
Below, there is comparison with training set and fitted values of final model to understand how the final model has learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.
The fitted values have captured the behaviour of the series nicely, however in peak dates for actual sales prediction values seem to be overpredicting in peak values of the original data. The model can be improved with adding regressors. For deciding which regressor that should be added, the correlation matrix that contains different attributes should be examined. Looking at the correlations with logarithm of favored_count and category_sold attributes should be chosen.
The regressors that are chosen are added to the model. It is clear that the new model’s AIC and BIC values are much lower, therefore the model is better and it can be proceeded with this model.
##
## Call:
## arima(x = detrend, order = c(2, 0, 5), xreg = xreg)
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4 ma5 intercept
## 1.3592 -0.4360 -0.9423 -0.3583 0.1100 0.3248 0.0239 -0.1683
## s.e. 0.1489 0.1385 0.1508 0.1140 0.1135 0.0733 0.0861 0.0066
## xreg1 xreg2
## 0 1e-04
## s.e. NaN NaN
##
## sigma^2 estimated as 0.06791: log likelihood = -29.79, aic = 81.57
## [1] 81.57108
## [1] 125.1705
The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.
## Time Series:
## Start = c(1, 1)
## End = c(1, 7)
## Frequency = 7
## [1] 265.7791 203.0056 191.3715 192.5693 209.1476 271.3931 284.5196
## [1] " Weighted Mean Absolute Percentage Error : 15.8553265197858"
Looking at the plot for Predictions vs Actual Sales, it can be said that the model captured the behaviour of the data partly but it is not the best fitting, also it can be seen on the above plot that there is a peak at Mean Absolute Errors in June 25 and also June 27,seeing peak at error plot in Jun 25 date for different products may be an indicator for unknown or undefined campaign of Trendyol or event ,in order to solve this issue, further investigation can be made.
It should be looked at the plot of data and examined the seasonalities and trend at first. For the empty places in sold counts, the mean of the data is taken. It’s clear that the variance of data is very large, from now on, it can be continued with the logarithm of sold count. Below, you can see the actual plot and the plot with the values that are taken logarithms. There is a slightly increasing trend, especially in the beginning and end of the plot. There can’t be seen any significant seasonality. To look further, there is a plot of 3 months of 2021 - March, April and May -. Again, the seasonality isn’t significant. In conclusion, it can be said that there is no seasonality.
Before decomposing, it can be looked to the autocorrelation plot of data. Below, it can be seen that there is a spike at lag 10. We can determine the frequency as 10 because there wasn’t significant seasonality.
Now, the data will be decomposed in order to make random series.“Additive” type of decomposing will be used because the variance of data is very low due to switching to logarithm. Plots of deseasonalized and the random series their autocorrelations can be seen below.
For Arima model, (p,d,q) values should be chosen. For this purpose, it can be looked at the ACF and PACF plots. Looking at the ACF, for ‘q’ value 3 or 10 can be chosen and looking at the PACF, for ‘p’ value 5 can be chosen.
The AIC and BIC values of models that are suggested can be seen below. Also, auto.arima function is used. So, looking at AIC and BIC values, (5,0,3) model that is suggested above is best among them. We can proceed with this model.
##
## Call:
## arima(x = detrend, order = c(5, 0, 10))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.1038 -0.6085 -0.0238 0.0208 -0.3801 -0.4228 0.2782 -0.4734
## s.e. 0.2650 0.2766 0.2537 0.2385 0.1981 0.2638 0.2787 0.2262
## ma4 ma5 ma6 ma7 ma8 ma9 ma10 intercept
## -0.3826 0.0892 -0.1551 -0.1419 0.0249 0.0155 0.1680 -2e-04
## s.e. 0.2512 0.3447 0.1415 0.0947 0.0959 0.0771 0.0874 2e-04
##
## sigma^2 estimated as 0.1485: log likelihood = -182.55, aic = 399.1
## [1] 399.1035
## [1] 466.3087
##
## Call:
## arima(x = detrend, order = c(5, 0, 3))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.5383 0.2324 -0.3555 0.0084 -0.0890 -0.8667 -0.435 0.3018
## s.e. NaN NaN NaN 0.0741 0.0641 NaN NaN NaN
## intercept
## -2e-04
## s.e. 2e-04
##
## sigma^2 estimated as 0.152: log likelihood = -186.81, aic = 393.62
## [1] 393.6233
## [1] 433.1557
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : 402.245
## ARIMA(0,0,0) with non-zero mean : 511.3063
## ARIMA(1,0,0) with non-zero mean : 514.1456
## ARIMA(0,0,1) with non-zero mean : 513.2568
## ARIMA(0,0,0) with zero mean : 509.2907
## ARIMA(1,0,2) with non-zero mean : Inf
## ARIMA(2,0,1) with non-zero mean : 418.6372
## ARIMA(3,0,2) with non-zero mean : Inf
## ARIMA(2,0,3) with non-zero mean : 403.2674
## ARIMA(1,0,1) with non-zero mean : 515.9668
## ARIMA(1,0,3) with non-zero mean : Inf
## ARIMA(3,0,1) with non-zero mean : 403.9175
## ARIMA(3,0,3) with non-zero mean : Inf
## ARIMA(2,0,2) with zero mean : 400.7951
## ARIMA(1,0,2) with zero mean : 419.113
## ARIMA(2,0,1) with zero mean : 417.3323
## ARIMA(3,0,2) with zero mean : Inf
## ARIMA(2,0,3) with zero mean : 401.841
## ARIMA(1,0,1) with zero mean : 513.9288
## ARIMA(1,0,3) with zero mean : Inf
## ARIMA(3,0,1) with zero mean : 402.7315
## ARIMA(3,0,3) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(2,0,2) with zero mean : Inf
## ARIMA(2,0,3) with zero mean : Inf
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(3,0,1) with zero mean : Inf
## ARIMA(2,0,3) with non-zero mean : Inf
## ARIMA(3,0,1) with non-zero mean : Inf
## ARIMA(2,0,1) with zero mean : Inf
## ARIMA(2,0,1) with non-zero mean : Inf
## ARIMA(1,0,2) with zero mean : Inf
## ARIMA(0,0,0) with zero mean : 509.2907
##
## Best model: ARIMA(0,0,0) with zero mean
## Series: detrend
## ARIMA(0,0,0) with zero mean
##
## sigma^2 estimated as 0.2187: log likelihood=-253.64
## AIC=509.28 AICc=509.29 BIC=513.23
## [1] 509.2802
## [1] 513.2335
Below, there is comparison with training set and fitted values of final model to understand how the final model is learned data. There is plot of random series, plot of logarithm of actual series and plot of actual series. The brown lines belong to fitted model and light blue lines belong to actual series.
The model can be improved with adding regressors. For deciding which regressor that should be added, it should be looked the correlation matrix that contains different attributes. Looking at the correlations with logarithm of sold counts, just basket_count attribute should be chosen.
The regressors that are chosen are added to the model. The new model’s AIC and BIC values aren’t much lower, there is a little difference betwwen the values. Therefore we can proceed with the previous model that is chosen because there wouldn’t be large difference between them.
##
## Call:
## arima(x = detrend, order = c(5, 0, 3), xreg = xreg)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 ma2 ma3
## 0.5109 0.4048 -0.4763 0.0248 -0.0671 -0.8330 -0.6177 0.4690
## s.e. 0.2497 0.2829 0.1680 0.0796 0.0767 0.2475 0.3326 0.2223
## intercept xreg
## -0.0112 1e-04
## s.e. NaN NaN
##
## sigma^2 estimated as 0.1517: log likelihood = -185.03, aic = 392.06
## [1] 392.062
## [1] 435.5477
The predictions will be made with final model. The predicted values with test set can be seen below. The mean absolute error for each day is plotted and the weighted mean absolute percentage error for this prediction can be seen as well.
## [1] " Weighted Mean Absolute Percentage Error : 64.6525081574837"
The product 7 is the Oral-B Rechargeable ToothBrush and it is not expected to increase depends on season effect the product since it is daily routine product, not particularly needed for some terms, however, it is possible to observe seasonality based on economic conditions and customer purchase habits.
The data covers approximately one year of sale information, it is not possible to examine the month seasonality, quarterly, weekly seasonality since it is only include one period. Therefore, the data is examined in frequency 7,14, and 30 days to see if the day of week, fortnight, and day of month seems seasonality.
The sales of product 7 is plotted in the below during time. The box-plot and histogram of sales is plotted to see if there is difference in distiribution of data in week days and month. Moreover ACF and PACF is plotted to see if there is autocorrelation in data.
By the time garpgh and ACF plot, it can be said that there is a trend in data.And the boxplot and histogram of data shows that there is day and month affect in data, the distibution of data differs. Finally, the ACf and Pacf plot show that there is high auto correlation in lag1 and lag7,
The summary of data is shown below to see nature of data.
## price event_date product_content_id sold_count
## Min. :110.1 Min. :2020-05-25 Length:405 Min. : 0.00
## 1st Qu.:129.9 1st Qu.:2020-09-03 Class :character 1st Qu.: 20.00
## Median :136.3 Median :2020-12-13 Mode :character Median : 57.00
## Mean :135.3 Mean :2020-12-13 Mean : 94.91
## 3rd Qu.:141.6 3rd Qu.:2021-03-24 3rd Qu.:139.00
## Max. :165.9 Max. :2021-07-03 Max. :513.00
## NA's :9
## visit_count favored_count basket_count category_sold
## Min. : 0 Min. : 0 Min. : 0.0 Min. : 321
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 92.0 1st Qu.: 610
## Median : 0 Median : 175 Median : 240.0 Median : 802
## Mean : 2267 Mean : 356 Mean : 399.2 Mean :1008
## 3rd Qu.: 4265 3rd Qu.: 588 3rd Qu.: 578.0 3rd Qu.:1099
## Max. :15725 Max. :2696 Max. :2249.0 Max. :5557
##
## category_brand_sold category_visits ty_visits category_basket
## Min. : 0 Min. : 346 Min. : 1 Min. : 0
## 1st Qu.: 0 1st Qu.: 657 1st Qu.: 1 1st Qu.: 0
## Median : 693 Median : 880 Median : 1 Median : 0
## Mean : 2991 Mean : 3896 Mean : 44737307 Mean : 18591
## 3rd Qu.: 5354 3rd Qu.: 1349 3rd Qu.:102143446 3rd Qu.: 41265
## Max. :28944 Max. :59310 Max. :178545693 Max. :281022
##
## category_favored
## Min. : 1242
## 1st Qu.: 2476
## Median : 3298
## Mean : 4202
## 3rd Qu.: 4869
## Max. :44445
##
It can be seen that some attributes has unrealistic behaviour. the more than half of category_basket is equals zero, that is not possible since the basket count and category sold is also should in category basket and they not equal zero when category basket equals. THE TRENDYOL visit is always equals one before the particular date and it shows unusual and unrealistic that trendyol only visited once a date. Also, visit count have similar behaviour, even if it is more inclusive meaning sometimes it is less than basket count and sold count.
By examining the correlation garph and relaibility of data , “price”,“visit_count”, “basket_count”,“category_basket” , “ty_visits”,“is_campaign” are choosen as regressors.
When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.
Additive Model, Multplive Model, and linear regression model is used for decomposition and get stationary data.
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0069
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.2127
The additive model give more stationary data, therefore, additive decomposition will be used in model construction.
## Series: random
## ARIMA(0,0,1)(0,0,2)[7] with non-zero mean
##
## Coefficients:
## ma1 sma1 sma2 mean
## 0.3383 0.0762 -0.1026 -0.0266
## s.e. 0.0460 0.0509 0.0501 2.1997
##
## sigma^2 estimated as 1145: log likelihood=-1969.4
## AIC=3948.79 AICc=3948.94 BIC=3968.73
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0071
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0196
The data is more stationary than daily decomposition. and addtive method is more stationary.
## Series: random
## ARIMA(0,0,1)(0,0,1)[14] with non-zero mean
##
## Coefficients:
## ma1 sma1 mean
## 0.3419 -0.1142 -0.0267
## s.e. 0.0462 0.0494 2.0169
##
## sigma^2 estimated as 1149: log likelihood=-1970.53
## AIC=3949.05 AICc=3949.16 BIC=3965.01
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0179
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.1052
The Additive method gave more stationary data, therefore it is used in model construction.
## Series: random
## ARIMA(1,0,1)(0,0,2)[30] with zero mean
##
## Coefficients:
## ar1 ma1 sma1 sma2
## 0.5412 0.1878 -0.2182 -0.0726
## s.e. 0.0643 0.0739 0.0580 0.0593
##
## sigma^2 estimated as 1613: log likelihood=-1916.29
## AIC=3842.57 AICc=3842.73 BIC=3862.21
The best result I get is the model ARIMA(4,0,0)(0,0,1)[30] wiht lowest AIC.
The regressors determined above i sused to imporve model.
##
## Call:
## arima(x = random, order = c(4, 0, 0), seasonal = c(0, 0, 1), xreg = xreg7)
##
## Coefficients:
## ar1 ar2 ar3 ar4 sma1 intercept price visit_count
## 0.6079 0.1697 0.0780 -0.0106 0.0739 87.5287 -1.0590 -0.0141
## s.e. 0.0560 0.0627 0.0608 0.0525 0.0514 43.4787 0.3013 0.0023
## basket_count category_basket ty_visits is_campaign
## 0.2308 2e-04 0 4.9606
## s.e. 0.0119 1e-04 NaN 6.0480
##
## sigma^2 estimated as 639.8: log likelihood = -1744.2, aic = 3514.41
The AIC decreased , the regressors improved the model.
## event_date actual add_arima_forecasted reg_add_arima_forecasted
## 1: 2021-06-26 40 129.52970 121.7252
## 2: 2021-06-27 46 121.41935 112.3915
## 3: 2021-06-28 64 123.59512 116.0881
## 4: 2021-06-29 137 125.11548 119.9548
## 5: 2021-06-30 131 116.14164 115.6428
## 6: 2021-07-01 130 100.83371 102.4004
## 7: 2021-07-02 108 98.68134 101.2189
### Model evaluation
The error rates shows the arima model have gave better results by considering WMAPE
## model n mean sd CV FBias MAPE
## 1 add_arima_forecasted 7 93.71429 42.48417 0.4533372 -0.2428603 0.7599683
## RMSE MAD MADP WMAPE
## 1 51.48474 41.396 0.4417256 0.4417256
## model n mean sd CV FBias MAPE
## 1 reg_add_arima_forecasted 7 93.71429 42.48417 0.4533372 -0.2033868 0.6881487
## RMSE MAD MADP WMAPE
## 1 46.49749 38.14113 0.4069937 0.4069937
The data covers approximately one year of sale information, it is not possible to examine the month seasonality, quarterly, weekly seasonality since it is only include one period. Therefore, the data is examined in frequency 7,14, and 30 days to see if the day of week, fortnight, and day of month seems seasonality.
The sales of product 8 is plotted in the below during time. The box-plot and histogram of sales is plotted to see if there is difference in distiribution of data in week days and month. Moreover ACF and PACF is plotted to see if there is autocorrelation in data.
The Product8 is a Jacket and it show incearsing in sales for some months. The temparature of season can have effect on it.
It can be seen that the sales is zero most of time, however, there is huge increase in October.
The ACF and PACF of data shows that there is significant autocorrelation in lag1, lag5, lag7 and lag20.
When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.
Additive Modeland Multplive Model are is used for decomposition and get stationary data.
the data is shown decomposed by additive method and multiplicative method.
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0089
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.069
The multiplicative method gave more stationary data and the model constructed by addtive method.
## Series: random
## ARIMA(0,0,0)(0,0,2)[7] with zero mean
##
## Coefficients:
## sma1 sma2
## 0.1981 -0.0968
## s.e. 0.0495 0.0494
##
## sigma^2 estimated as 6.751: log likelihood=-946.38
## AIC=1898.76 AICc=1898.82 BIC=1910.72
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0061
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.173
The addtive model gave more stationary data, therefore it is used for model.
## Series: random
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## -0.5945 0.7755
## s.e. 0.1097 0.0851
##
## sigma^2 estimated as 7.911: log likelihood=-958.2
## AIC=1922.4 AICc=1922.46 BIC=1934.3
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0114
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.2785
The additive model is used for model construction since it is more stationary
## Series: random
## ARIMA(0,0,1)(0,0,1)[30] with zero mean
##
## Coefficients:
## ma1 sma1
## 0.2542 -0.1282
## s.e. 0.0510 0.0605
##
## sigma^2 estimated as 8.243: log likelihood=-926.9
## AIC=1859.79 AICc=1859.86 BIC=1871.57
ARIMA(0,0,1)(0,0,1)[30] with zero mean fives the best AIC result by comparing frequencty 7 and 14. Therefore it is used for predcitions and regressor model.
## price event_date product_content_id sold_count
## Min. : -1.0 Min. :2020-05-25 Length:405 Min. : 0.0000
## 1st Qu.:350.0 1st Qu.:2020-09-03 Class :character 1st Qu.: 0.0000
## Median :600.0 Median :2020-12-13 Mode :character Median : 0.0000
## Mean :559.3 Mean :2020-12-13 Mean : 0.9284
## 3rd Qu.:734.3 3rd Qu.:2021-03-24 3rd Qu.: 0.0000
## Max. :833.3 Max. :2021-07-03 Max. :52.0000
## NA's :303
## visit_count favored_count basket_count category_sold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 16.0
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 45.0
## Mean : 27.24 Mean : 2.242 Mean : 5.83 Mean : 200.2
## 3rd Qu.: 3.00 3rd Qu.: 2.000 3rd Qu.: 5.00 3rd Qu.: 111.0
## Max. :516.00 Max. :37.000 Max. :247.00 Max. :3299.0
##
## category_brand_sold category_visits ty_visits category_basket
## Min. : 0 Min. : 367 Min. : 1 Min. : 0
## 1st Qu.: 0 1st Qu.: 1432 1st Qu.: 1 1st Qu.: 0
## Median : 6 Median : 5324 Median : 1 Median : 0
## Mean : 46247 Mean : 27767 Mean : 44737307 Mean : 353021
## 3rd Qu.: 94562 3rd Qu.: 9538 3rd Qu.:102143446 3rd Qu.: 464380
## Max. :259590 Max. :583672 Max. :178545693 Max. :3102147
##
## category_favored w_day mon is_campaign
## Min. : 2324 Min. :1.000 Min. : 1.000 Min. :0.00000
## 1st Qu.: 8618 1st Qu.:2.000 1st Qu.: 4.000 1st Qu.:0.00000
## Median : 24534 Median :4.000 Median : 6.000 Median :0.00000
## Mean : 33688 Mean :4.007 Mean : 6.464 Mean :0.08642
## 3rd Qu.: 50341 3rd Qu.:6.000 3rd Qu.: 9.000 3rd Qu.:0.00000
## Max. :244883 Max. :7.000 Max. :12.000 Max. :1.00000
##
the correlation of price, visit_count, and basket_count is high and it is expected if the sold_count is zero this variables can be zero.
However, it is not expected that category favored and trendyol visits is zero or one therefore these variables changed as mean.
By considering correlation and variable relaibility the “price”,“visit_count”, “basket_count”,“category_favored” are selected as regressors. And, the graph of monthly distirbution shows that mon can be effective factor, therefore, it is also added.
##
## Call:
## arima(x = random, order = c(0, 0, 1), seasonal = c(0, 0, 1), xreg = xreg8)
##
## Coefficients:
## ma1 sma1 intercept price visit_count basket_count
## 0.3798 0.0389 2.9376 -0.0029 -0.0040 0.1477
## s.e. 0.0308 0.0313 0.7698 0.0013 0.0019 0.0057
## category_favored mon
## 0 -0.1838
## s.e. NaN 0.0410
##
## sigma^2 estimated as 3.173: log likelihood = -748.72, aic = 1515.45
the AIC got smaller when regressors added to model, and improved the model.
## event_date actual add_arima_forecasted reg_add_arima_forecasted
## 1: 2021-06-25 2 1 1
## 2: 2021-06-26 1 1 1
## 3: 2021-06-27 0 1 1
## 4: 2021-06-28 4 1 1
## 5: 2021-06-29 1 2 2
## 6: 2021-06-30 0 2 2
## 7: 2021-07-01 1 3 3
## Model Evaluation
## model n mean sd CV FBias MAPE RMSE
## 1 add_arima_forecasted 7 1.285714 1.380131 1.073435 -0.2222222 Inf 1.690309
## MAD MADP WMAPE
## 1 1.428571 1.111111 1.111111
## model n mean sd CV FBias MAPE
## 1 reg_add_arima_forecasted 7 1.285714 1.380131 1.073435 -0.2222222 Inf
## RMSE MAD MADP WMAPE
## 1 1.690309 1.428571 1.111111 1.111111
The arima model without regressors gives better results by comparing WMAPE.
The data covers approximately one year of sale information, it is not possible to examine the month seasonality, quarterly, weekly seasonality since it is only include one period. Therefore, the data is examined in frequency 7,14, and 30 days to see if the day of week, fortnight, and day of month seems seasonality.
The sales of product 8 is plotted in the below during time. The box-plot and histogram of sales is plotted to see if there is difference in distiribution of data in week days and month. Moreover ACF and PACF is plotted to see if there is autocorrelation in data.
By observing the graph below, the month effect is clearly observable. It is expected since bikini is wore in hot seasons in Turkey. Moreover, by examined the acf and pacf graph, it can be said that there is trend in data and correlation with lag1 and lag7.
## price event_date product_content_id sold_count
## Min. :59.99 Min. :2020-05-25 Length:405 Min. : 0.00
## 1st Qu.:59.99 1st Qu.:2020-09-03 Class :character 1st Qu.: 0.00
## Median :59.99 Median :2020-12-13 Mode :character Median : 0.00
## Mean :60.11 Mean :2020-12-13 Mean : 18.35
## 3rd Qu.:59.99 3rd Qu.:2021-03-24 3rd Qu.: 3.00
## Max. :63.55 Max. :2021-07-03 Max. :286.00
## NA's :281
## visit_count favored_count basket_count category_sold
## Min. : 0 Min. : 0.0 Min. : 0.00 Min. : 20
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 132
## Median : 0 Median : 0.0 Median : 0.00 Median : 563
## Mean : 2457 Mean : 240.8 Mean : 88.64 Mean :1301
## 3rd Qu.: 589 3rd Qu.: 112.0 3rd Qu.: 19.00 3rd Qu.:1676
## Max. :45833 Max. :5011.0 Max. :1735.00 Max. :8099
##
## category_brand_sold category_visits ty_visits category_basket
## Min. : 0 Min. : 107 Min. : 1 Min. : 0
## 1st Qu.: 0 1st Qu.: 397 1st Qu.: 1 1st Qu.: 0
## Median : 2965 Median : 1362 Median : 1 Median : 0
## Mean : 14028 Mean : 82604 Mean : 44737307 Mean : 118415
## 3rd Qu.: 15079 3rd Qu.: 2871 3rd Qu.:102143446 3rd Qu.: 101167
## Max. :152168 Max. :1335060 Max. :178545693 Max. :1230833
##
## category_favored w_day mon is_campaign
## Min. : 628 Min. :1.000 Min. : 1.000 Min. :0.00000
## 1st Qu.: 2589 1st Qu.:2.000 1st Qu.: 4.000 1st Qu.:0.00000
## Median : 7843 Median :4.000 Median : 6.000 Median :0.00000
## Mean : 15287 Mean :4.007 Mean : 6.464 Mean :0.08642
## 3rd Qu.: 16401 3rd Qu.:6.000 3rd Qu.: 9.000 3rd Qu.:0.00000
## Max. :135551 Max. :7.000 Max. :12.000 Max. :1.00000
##
the “price”,“category_sold”, “basket_count”,“category_favored” attributes are more relaible and significantly corralet with data. Even if the visit_count and favored_count is very high corralled with data, they also corraleted with basket_count therefore they do not used as regressors. And “mon” is seem to affect on sales by th monthly distribution graph.
When arima models is constructed, the auto.arima function is used, and in every day the auto.arima function is runs again. the seasonality is TRUE, and frequency is determined as seven by observing ACF and PACF graph.
Additive Model, Multplive Model, and linear regression model is used for decomposition and get stationary data.
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0082
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.083
The addtive method give more stationary results therefore it is used for models.
## Series: random
## ARIMA(0,0,2)(0,0,2)[7] with zero mean
##
## Coefficients:
## ma1 ma2 sma1 sma2
## 0.0016 -0.2206 0.1231 0.1535
## s.e. 0.0700 0.0800 0.0532 0.0552
##
## sigma^2 estimated as 103.2: log likelihood=-1489.39
## AIC=2988.77 AICc=2988.93 BIC=3008.72
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.0087
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.2302
the additive model give more stationary results and it is used for model construction.
## Series: random
## ARIMA(1,0,0)(0,0,2)[14] with zero mean
##
## Coefficients:
## ar1 sma1 sma2
## 0.4729 0.0831 -0.1172
## s.e. 0.0445 0.0509 0.0593
##
## sigma^2 estimated as 158.4: log likelihood=-1543.96
## AIC=3095.93 AICc=3096.03 BIC=3111.8
## [1] "the additive model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.028
## [1] "the multiplicative model"
##
## #######################################
## # KPSS Unit Root / Cointegration Test #
## #######################################
##
## The value of the test statistic is: 0.6099
“the additive model” give more stationary data, therefore it is used as model data.
## Series: random
## ARIMA(1,0,0)(0,0,2)[30] with zero mean
##
## Coefficients:
## ar1 sma1 sma2
## 0.6941 -0.1337 -0.1976
## s.e. 0.0370 0.0670 0.1009
##
## sigma^2 estimated as 185.5: log likelihood=-1511.85
## AIC=3031.71 AICc=3031.81 BIC=3047.41
the ARIMA(0,0,1)(0,0,1)[30] with zero mean model gives lower AIC value, therefor it is used for regressive model.
##
## Call:
## arima(x = random, order = c(0, 0, 1), seasonal = c(0, 0, 1), xreg = xreg8)
##
## Coefficients:
## ma1 sma1 intercept price visit_count basket_count
## 0.5569 -0.1643 0.3158 -0.0041 0.0413 0.0201
## s.e. 0.0343 0.0817 6.5663 0.0106 0.0179 0.0489
## category_favored mon
## 0e+00 0.0841
## s.e. 1e-04 0.3520
##
## sigma^2 estimated as 231.4: log likelihood = -1553.47, aic = 3124.94
the AIC value i slowerr than arima mode, so it is improved model.
## event_date Actual add_arima_forecasted reg_add_arima_forecasted
## 1: 2021-06-25 20 33.04259 33.47467
## 2: 2021-06-26 27 28.24629 28.63710
## 3: 2021-06-27 20 31.74235 32.14315
## 4: 2021-06-28 26 32.17666 32.60195
## 5: 2021-06-29 19 28.66079 29.05578
## 6: 2021-06-30 20 25.90133 26.29765
## 7: 2021-07-01 14 24.36589 24.74094
## Model Evaluation
## model n mean sd CV FBias MAPE
## 1 add_arima_forecasted 7 20.85714 4.413184 0.211591 -0.3981911 0.4381313
## RMSE MAD MADP WMAPE
## 1 9.128483 8.305128 0.3981911 0.3981911
## model n mean sd CV FBias MAPE
## 1 reg_add_arima_forecasted 7 20.85714 4.413184 0.211591 -0.4174742 0.4581128
## RMSE MAD MADP WMAPE
## 1 9.497635 8.707319 0.4174742 0.4174742
The arima model with no regressors gives better result in test data.
In order to find best decomposition level and with respect to that find the best ARIMA models for different products, different decomposition levels have tried and selected, then ARIMA models have been tried and their performance on the test set have been measured, which consists of dates from 24 June 2021 to 30 June 2021, different models have been selected for each product.
Since sales are affected from the overall component of the economy, so more external data could be included such as dollar exchange rate, for improved accuracy.
Approaching differently to each product is one of the strong sides of the model, since it is a time consuming task. Also comparing AIC values for the model auto arima suggested with the models that have been selected with respect to ACF and PACF plots and measuring their performances based on their predictions on the test data is also a strong side of the models that have been proposed for each product.
Overall, it can be said that models work fine, deviation from the real values is not too big.
Lecture Notes
The code of my study is available from here